Semi‐supervised classification of fundus images combined with CNN and GCN

Abstract Purpose Diabetic retinopathy (DR) is one of the most serious complications of diabetes, which is a kind of fundus lesion with specific changes. Early diagnosis of DR can effectively reduce the visual damage caused by DR. Due to the variety and different morphology of DR lesions, automatic classification of fundus images in mass screening can greatly save clinicians' diagnosis time. To alleviate these problems, in this paper, we propose a novel framework—graph attentional convolutional neural network (GACNN). Methods and Materials The network consists of convolutional neural network (CNN) and graph convolutional network (GCN). The global and spatial features of fundus images are extracted by using CNN and GCN, and attention mechanism is introduced to enhance the adaptability of GCN to topology map. We adopt semi‐supervised method for classification, which greatly improves the generalization ability of the network. Results In order to verify the effectiveness of the network, we conducted comparative experiments and ablation experiments. We use confusion matrix, precision, recall, kappa score, and accuracy as evaluation indexes. With the increase of the labeling rates, the classification accuracy is higher. Particularly, when the labeling rate is set to 100%, the classification accuracy of GACNN reaches 93.35%. Compared with DenseNet121, the accuracy rate is improved by 6.24%. Conclusions Semi‐supervised classification based on attention mechanism can effectively improve the classification performance of the model, and attain preferable results in classification indexes such as accuracy and recall. GACNN provides a feasible classification scheme for fundus images, which effectively reduces the screening human resources.


INTRODUCTION
Diabetic retinopathy (DR) is a microvascular disease that occurs in the retina in diabetes. It is mainly pathologically characterized by retinal vascular changes. The fundus is mostly manifested as retinal exudation and edema, neovascularization, hemorrhage, and the formation of proliferating membrane.
Normal fundus images mainly contain arteries, veins, macula, optic disc, and other structures, while common abnormal lesions in DR images include microaneurysm, hemorrhagic spots, white spots, cotton wool spots, neovascularization, and so forth. Microaneurysm showed red round spots on fundus images, reflecting the changes of vascular performance in the early stage. Hemorrhagic spots are caused by blood leakage from blood vessels to the retina, which presents dark red spots or massive spots on fundus images. The formation of white spots is related to the accumulation of fatty tissue caused by retinal neuropathy. Leukoplakia are bright white plaques caused by nutrients such as lipids and proteins leaking from blood vessels into the retina. The formation of cotton wool spots is related to focal ischemia and necrosis of nerve tissue. Neovascularization is ischemia caused by vascular obstruction, which causes the retina to generate small and disordered neovascularization. The normal fundus images and DR fundus images are shown in Figure 1.
Clinically, according to the presence or absence of retinal neovascularization, DR is divided into nonproliferative diabetic retinopathy (NPDR) (or simple or background type) and proliferative diabetic retinopathy (PDR). 1 NPDR is divided into three stages: (1) Stage I: The appearance of microvascular tumor in the fundus.
(2) Stage II: White spots with clear edge and irregular shape appeared in the fundus. (3) Stage III: Cotton flocculent leukoplakia appears in the fundus.
As shown in the Figure 2, the DR images were divided into five grades according to the severity: normal, mild NPDR, moderate NPDR, severe NPDR, and PDR.
Due to the complex lesions of DR images and different pathological manifestations, the identification efficiency of doctors alone is extremely low and consumes a lot of medical resources. The purpose of this study is to achieve DR images classification by extracting pathological features from DR images.
In recent years, deep learning is undergoing rapid changes, and has achieved remarkable achievements in image processing, semantic recognition. Nayak et al. 2 proposed to classify the fundus images into three categories: normal, proliferative, and non-proliferative by using exudation and vascular regional texture combined with neural network; Adarsh et al. 3 used image processing technology to identify blood vessel, exudate, microaneurysm, and other lesions with texture features in retinal images, so as to classify DR lesions; Prentasic 4 used convolutional neural network (CNN) to detect exudates in color fundus images, which improved the performance of DR diagnosis. Liu et al. 5 proposed to classify optical coherence tomography (OCT) images by extracting texture features.
Li et al. 6 trained the visual geometry group 16 (VGG-16) model with deep transfer learning. They detected and classified age-related macular degeneration (AMD) and diabetic macular edema (DME) images through OCT images, and showed better performance in retinal image classification. Schlegl et al. 7 used a CNN framework based on encoder-decoder to detect intraretinal cystoid fluid (IRC) and subretinal fluid (SRF), which achieved a classification accuracy of 0.94 for OCT images. Christopher et al. 8 evaluated and compared the performance of VGG-16, Inception-v3, and ResNet models. Transfer learning method is used to improve the detection performance of glaucomatous optic neuropathy (GON) and accelerate the convergence of the model.
In order to better implement the deep learning mechanism, Shanthi et al. 9 proposed an improved AlexNet architecture for DR classification, which improved the F I G U R E 1 Comparison of normal retina and diabetic retinopathy images F I G U R E 2 Diabetic retinopathy images severity grading classification performance. Shankar et al. 10 proposed a new automated hyperparameter tuning perception-v4 (HPTI-v4) model for DR classification of color fundus images.
Deep neural network has a good effect on feature extraction ability. In addition, due to the interactive relationship between different lesion types of retinopathy, we use graph convolutional network (GCN) to describe this relationship, and combine it with the image features extracted by CNN to classify the images.
The concept of graph neural network (GNN) was first proposed in 2005. In 2009, Franco 11 defined the theoretical basis of GNNs in his paper. GCN is the most commonly used network, which belongs to a branch of GNN. In 2013, on the basis of graph signal processing, Bruna 12 first proposed a CNN based on the frequency domain and based on the spatial domain on the graph. In fact, graph convolution based on frequency domain can be regarded as a special spatial method.
GCN has been widely used in small sample learning, 13 point clouds, 14 image classification, 15 and other tasks, and achieved good results. Ma et al. 16 designed a pooling operator based on Fourier transform and combined it with GCN for graph classification, achieving good performance on multiple datasets. Lin et al. 17 used a method based on GCN and selfsupervised learning to classify fundus images.
In order to enhance the feature extraction capability of the network, attention mechanism is introduced in many tasks, which can make the network to focus on important information and reduce information loss. Bahdanau et al. 18 applied the attention mechanism to the field of NLP for the first time. Petar et al. 19 proposed graph attention networks (GAT),which applied the attention mechanism to GCNs, which effectively improved the generalization ability of the model.
Semi-supervised learning is a hot topic in the field of medical image research. It combines labeled data and unlabeled data for learning. The basic idea is to optimize the model established by labeled data by using unlabeled data. The common semi-supervised learning methods include graph-based semi-supervised method, co-training method, generative method, and so on. Qiao et al. 20 combined co-training with deep network, regarded multiple subnetworks as multi-view networks, and used adversative samples to improve view diversity. Ghorbani 21 used a semi-supervised approach to adjust the loss function, and proposed an end-to-end learning method for semi-supervised classification. Sohn et al. 22 designed a semi-supervised learning framework combining self -training and consistency regularization for target detection. The teacher model is trained to generate pseudo labels of unlabeled images, and the false labels are enhanced to update the model.
On the basis of previous work, we propose a new model, which combines CNN and GCN to enhance the ability to capture image information, and uses attention mechanism in the graph convolution module to reduce information loss.
Our main contributions can be summarized as follows: 1. In order to improve classification accuracy, our network use both CNN and GCN to learn the features and node relationships of multi-label fundus images. The network uses max-pooling layer to make the image retain more detailed information. 2. The graph convolution module based on the attention mechanism can effectively capture important node information of fundus images, extract structural features, and construct node sequences. 3. We use semi-supervised learning to improve the learning performance, solve the problem of weak generalization and inaccuracy of the model, and improve the classification performance of fundus images.

Overview
This section describes our network in detail.The network is constructed based on the improved VGG-19 23 and GCN. 24 The network structure is illustrated in Figure 3, which consists of three parts: the CNN module, the GCN module based on the attention mechanism, and the classifier. In the CNN branch, we use a modified VGG-19 network to extract image features and output a feature vector. For the GCN module, the attention mechanism is introduced to generate valuable feature from every single hop of the graph convolutional layer to reduce noise and redundancy from input and capture node features. The attention mechanism can help us solve the problem of selecting the order of graph nodes. The attention graph convolutional (AGC) module uses the attention mechanism to replace the fixed standardized operations in the graph convolution. Finally, the information captured by the two parts is fused and input into the classifier for classification.

CNN module
In Figure 3, our CNN module is a refinement of VGG-19. As shown in Table 1, for VGG-19, we delete the max-pooling layer in the last convolutional block and keep other max-pooling layers and all convolutional layers in the model. Finally, we replace the following three fully connected layers with a max-pooling layer. 25 The output of the last convolutional layer of the CNN module is a three-dimensional feature map with the size of 14 × 14 × 512, which is transformed into a feature vector with the length of 512 by max-pooling layer. The maxpooling layer retains the maximum value of the region and preserves the features with high recognition, which can reduce the migration error caused by the convolutional layer parameters. Max-pooling can ensure the position and rotation invariance of features and retain more texture features of images. In addition, it reduces the number of parameters and can compress the size of model well.

Graph construction
We employ graph to describe the correlation between various lesions, and use speeded up robust features (SURF) algorithm 26 to detect the possible lesion regions of DR images. Figure 3 shows the results of feature points' detection of fundus images using SURF algorithm. The detected objects are represented as graph nodes, each of which is connected to all other nodes except itself. Suppose y i represents the spatial center coordinate vector of node i. The region information of each node is defined as {m i , n i , w i , h i }, where m i and n i are the upper-left coordinates of the node, w i and h i are, respectively, the width and height of this region. Node i can be denoted as ). The connection weight between nodes i and j is calculated as follows: where is the parameter and is set to 1.6. N k (j) represents the set of k-nearest neighbor (KNN) 27 of vertex i. We use Euclidean distance to measure the distance between two nodes and set K is 8. When both node i and j are within each other's KNN range, it is judged that there is a connection between the two nodes.

Graph convolutional network
Graph in GCN refers to the topological graph in which vertices and edges are used to establish corresponding relationships in exponential science. GCN directly operates on a graph, and outputs the embedding vector of nodes according to the nature of the neighborhood of the node. 28 Define a graph as is defined as the feature matrix, where n is the number of nodes, m denotes the feature dimension, and x i is the feature vector of the node v. A ∈ R n×n is the adjacency matrix of graph G, which represents the adjacency relationship between any two vertices, adjacency is 1, and non-adjacency is 0. D is the degree matrix and The element on the diagonal is the degree of the vertex, that is, the number of elements linked by the element. The output features of a single layer GCN are calculated as follows: where is an activation function, such as ReLU(x) = max(0, x). According to the above-mentioned one-layer convolution formula, the calculation formula for the multi-layer graph convolution with depth j is as follows:

Attention module
In the graph convolutional layer, a convolutional operation is performed on the neighbors of each node, and the node is updated with the result of the convolution. Then through the activation function and a convolutional layer, F I G U R E 4 Attention graph convolutional layer until the number of layers reaches the desired depth.
In this process, the convolutional results of each layer can only be used for the next convolutional operation, resulting in a large amount of information loss. 29 In order to reduce the loss of information and suppress useless information, we use the attention mechanism to extract important information in each convolutional operation and construct the AGC layer as shown in Figure 4. Due to different structure information of different volume layers, we use attention mechanism to aggregate important information in each convolutional step,and get the combination of nodes for each layer as follows: where a i is the attention weight of each hop, and H j v represents the structural feature of node v in j hop. Graph G can be represented by nodes as:

Loss function
Our model adopts semi-supervised approach for classification. 30 First, the labeled samples are used to calculate the parameters, and then the label of unlabeled nodes is obtained by forward propagation.
We suppose that f k represents the features extracted by the CNN module and h k represents the information extracted by GCN module. The image features extracted by the two branch modules are combined as follows: where (x) represents the weight matrix of the fully connection layer, and is used to adjust the fusion ratio of the features of the two modules, ⊕ represents element-wise addition. In our experiment, the value of is 1. The train set is randomly divided into labeled data and unlabeled data, and both kinds of data are used for training. In the training process, after every weight update, the prediction of unlabeled data is regarded as pseudo labels. We denote the labeled dataset as S and the unlabeled dataset as U. For labeled image S i ∈ S, whose corresponding ground truth label is P i . For unlabeled image U j ∈ U, whose corresponding pseudo label isP j . Z represents the final prediction label. We use the cross-entropy loss function in the training process, and weigh the loss function of labeled samples and unlabeled samples to optimize the model.
For labeled data: (7) where K represents the number of categories. For unlabeled sample, the loss function is as follows: Thus, the total loss function of our model is: where is the hyperparameter used to adjust the two terms, and its value is set to 0.1 in this experiment.

Dataset
Our dataset comes from the public dataset of the APTOS 2019 Blindness Detection challenge, which contains train images and test images. The dataset size is 9.51G and contains 5590 images, of which 3662 are used for training and 1928 for testing. It also contains train.csv and test.csv, which respectively contain the label of each picture in the train dataset and test dataset. The aim of this competition is to analyze the severity of DR, the normal is 0, the worst is 4. The purpose of this study is to use artificial intelligence methods to diagnose DR as soon as possible and avoid the deterioration of the disease. There are several obvious problems with the data, for example, the image size is not consistent, the image brightness is different, the size, color, and brightness are also different. Training the original image directly increases the difficulty of training, and it is not easy to find the features of lesions. We need to preprocess the images.
There are black areas on the edges of the original fundus image. The black area is meaningless for classification, so the part with lower pixel value of image edge is removed. Take the diameter of fundus image as the side length and the eye center as the image center to cut the image. In the original image, the black areas only exist on the left and right sides of the image, so it only needs to be cropped in the vertical direction. The image size is adjusted to 224 × 224. For the image with color deviation, supersaturation or undersaturation, the ImageEnhance module of PIL tool is used to perform brightness equalization, color balance adjustment, and contrast adjustment, so that different image display effects are more consistent and highlight features. After that, we remove Gaussian blur 31 from the image to obtain the difference to enhance the image.
As shown in Table 2, the distribution n of all types of samples in the dataset is unbalanced. We adopt data augmentation methods such as flipping, rotation, and shift to increase the number of samples of each class to 1800 to balance the distribution of the categories.

Evaluation measures
The evaluation indexes of the experiments included confusion matrix, precision, recall, kappa score, and accuracy.The accuracy represents the proportion of correctly classified samples among all samples, and can be expressed as: where TP represents the true positive,TN represents the true negative, FP represents the false positive, and FN represents the false negative. The precision represents the proportion of the number of correctly predicted images in the total number of positive predicted images, while recall determines the number of positive predicted images in all labeled images. The precision, recall, and kappa score can be expressed as: Recall = TP TP + FN (12) where p 0 is accuracy. Suppose that the true sample numbers of each class are a 1 , a 2 , … , a K , while the predicted sample numbers of each class are, respectively,

Methods for comparison
We selected several advanced models for comparison. To ensure the diversity of methods, we selected the following networks: DenseNet121, 32 DeepWalk, 33 dual attention graph convolutional network (DAGCN), 29 DGCN, 34 Graph REsidual rE-ranking Network (GREEN), 35 hybrid graph convolutional network (HGCN). 36 • DenseNet121. This network creatively proposes the dense block, and adds convolutional layer and pooling layer between each Dense Block. In each dense block, any two layers are directly connected. Thus, the input of each layer of the network is the union of the output of all the previous layers, and the feature map learned by this layer will be directly transmitted to all the subsequent layers as input. DenseNet121 greatly reduces the amount of parameters and the reuse of features. • DeepWalk. DeepWalk uses the random walk, which mainly includes random walk and generation of representation vectors. First, the vertex vector representation is extracted from the graph by random walk algorithm, which is regarded as words in the language model, and the sequence of nodes is simulated as sentences in the language. This method applies unsupervised presentation learning to graphs, and can create meaningful representations for large-scale graphs. • DAGCN. DAGCN adopts dual attention structure, which can extract information from different hop. The network uses an innovative self -attentional pool technique to represent graph information as an embedded matrix, which maximizes the raw information behind the graph. Since the attention mechanism is also used in our method, DAGCN represents an important contrast.
• DGCN. The model adopts two simple parallel feedforward networks local consistency convolution and global consistency convolution. The difference is only that the input graph structure information is different, and the convolution parameters of the two parallel graphs are shared. After the two branches, the loss function is added to combine local consistency and global consistency, which makes good use of the prior knowledge of the original data.

Experimental results
We implement our network based on Pytorch. The optimizer of network training chooses stochastic gradient descent (SGD) optimizer, 37 the momentum is 0.9 and the weight decay is 0.0001. The initial learning rate is set to 0.0001 and the learning rate decay strategy is divided by 10 for every 20 epochs and train 60 epochs with a batch size of 32. Hyperparameter α is set to 0.2. According to the above framework, 10-fold cross-validation is adopted to improve the generalization ability. Figure 5 shows the training and testing result curves of the method in this paper on APTOS dataset. We train and test the model with 60 epochs. Accuracy and loss began to stabilize after the 32nd epoch.
We set the labeling rate to 1%, 2%, 5%, 10%, 20%, 25%, 50%, and 100% for testing. In Table 3, the classification accuracy of different labeling rates is given. Accuracy is relatively low when the labeling rate is 1% or 2%. With the increase of the labeling rates, the classification accuracy is higher. When the  labeling rate is set to 100%, the model only uses labeled data for fully supervised training. At this time, the classification accuracy of graph attentional convolutional neural network (GACNN) reaches 93.35%, compared with DenseNet121, the accuracy is improved by 6.24%. Also, compared with DAGCN, the accuracy is improved by 4.9%.
In addition, we also use other visualization results. Figure 6 shows the confusion matrixes of these models. The colors on the diagonal represent the performance of the classification. In Figure 6, it can be clearly seen that the diagonal colors of DenseNet121, DeepWalk, and GCN are lighter. Compared with other categories, the classification error rates of the two categories of images labeled '2' and '3' are relatively high. Figure 7 shows the t-distributed stochastic neighbor embedding (t-SNE) 38 visualization of the feature maps output by each network. We use five colors to mark the five categories. We can find that after embedding high-dimensional data into two-dimensional space through t-SNE, the category information between data is retained, and the distribution between different categories is obvious.

Influence of semi-supervised method on model performance
To verify the effectiveness of semi-supervised learning, we compared the two methods of using only labeled  data and both labeled and unlabeled data. We compared the experimental results with the accuracy of the method that simultaneously used labeled data and unlabeled data in Table 3, as shown in Table 4. We set the labeling rate to 1%, 2%, 5%, 10%, 20%, 25%, and 50%, and input only labeled data (no unlabeled data) to our model. It can be seen from Table 4 that the model using semisupervised learning method achieves higher accuracy. For example, the accuracy of the proposed method with 1% labeled data and 99% unlabeled data is 65.79%, which is higher than that of 1% labeled data.

Effect of regularization weight
In formula (9), GACNN uses hyperparameters to balance two loss functions. In the following experiments, we evaluate the performance of GACNN with different hyperparametric settings. Figure 8 shows the experimental results under different parameter settings. Classification performance is the best when is set to 1e-1 or 1e-2.

Ablation study
In order to further verify the effectiveness of the network, we conducted ablation experiments. We conducted experiments on the network without attention mechanism and the GCN based on attention mechanism. Table 5 compares the impact of each part on the performance of network classification, and analyzes the effectiveness of CNN branch and attention mechanism on the model. Compared with GCN, the accuracy of our method is improved by 5.71%, which is due to the combination of the global feature extracted by CNN and the spatial feature extracted by GCN. In addition, the introduction of attention mechanism improves the adaptability of GCN to topology and increases the weight of target nodes. In general, the proposed method can effectively improve the classification accuracy and enhance the generalization ability of the model.

DISCUSSION
In this paper, we proposed a classification model of lesion degree of fundus images-GACNN. The main advantages of our proposed method are as follows: on the one hand, GACNN combines CNN and GCN to extract local and global features of fundus images, and F I G U R E 9 Visual explanation of each model. Red represents higher weight, while blue represents lower weight uses attention mechanism to aggregate the graph information. On the other hand, we expanded the train set before training process, which can improve the robustness of the model. In addition, the model improves the generalization ability of the network by combining two loss functions to guide the learning process. However, Table 3 shows that when the marking rate is a specific value, the classification results of our method are inferior to those of GREEN and HGCN, which may be due to the difference between class dependency module and graph learning module. In GREEN, the author inputs the image classes as graph nodes into GCN, and multiplies the output class adjacency matrix with the feature vector output by CNN to update the weight, and output the predicted value. GREEN directly processes category information rather than image features. In HGCN, the idea of graph learning module is to input the image features extracted by CNN into GCN, use GCN to build the adjacency matrix of nodes, and finally use two-part loss function for learning.
The performance of the model is verified by the experiments in the previous section. Table 3 compares the experimental results of various models under different labeling rates. With the increase of labeling rate, the performance of the model is gradually improved. When the labeling rate is set to 100%, the classification accuracy, precision, and recall reach the highest in the experiment. Except for GREEN and HGCN, GACNN performs best in all other models.
Since labeled data requires a huge workload of annotation, semi-supervised learning is efficient to train a comparable model, which can also take advantage of the unlabeled data. Table 4 demonstrate the results of GACNN with/without semi-supervised learning, it shows that the unlabeled data can help improve the accuracy of the model with only the labeled data, and we can conclude that semi-supervised learning method has the potential of improving the classification performance.
Particularly, our method uses the similar attention mechanism strategy as DAGCN, but shows better performance on this dataset, which shows that the combination of VGG and GCN is effective for improving classification performance. In Table 5, we compared the classification results of GACNN and its variants on the APTOS dataset. The results show that GACNN has the best accuracy. GCN can automatically learn node characteristics and association information between nodes. We construct the graph structure to obtain the relationship between different lesion regions, so that the network can learn the differences between the lesion features of various images, so as to better classify the images. The accuracy of our model is significantly higher than that of the model without attention mechanism, which indicates the influence of attention mechanism in the model.Attention mechanism helps the network identify local lesion areas, enhances the region of interest, and reduces the recognition of irrelevant areas.
We visualize the features learned by the model and highlight the important areas in the image for classification to show the decision-making mode of classification. Figure 9 shows the visualization results of Gradientweighted Class Activation Map (Grad-CAM) of each model. Figures 6 and 7 show the visual experimental results of each model. The performance of our method is better than other baselines. It can be seen from the figures that DenseNet121 and DeepWalk do not have clear classification boundaries for different classes of images. Due to the complexity of fundus images and the diversity and similarity of lesion types, it is difficult for some networks to identify DR images with similar lesion types. Therefore, our model ameliorates this problem.
However, the classification performance of the model for the categories labeled 2 and 3 is poor, which may be due to the similarity of the two classes of image features. Images labeled 2 are mostly represented by the appearance of white spots, while images labeled 3 are mostly represented by the appearance of cotton flocculent leukoplakia in the fundus. However, white spots and bleeding spots often exist in images labeled 3, which leads to the similarity between the two classes of images.
Finally, we study the influence of different hyperparameters on network performance. As shown in Figure 8, when is set to 0, the experimental results decrease significantly, and the loss function only depends on the previous item, which is also feasible in the experiment.
However, our experiment still has some limitations. The amount of data in our experiment is insufficient and the category distribution is uneven. Our study only considers the images after data enhancement. In the future, we intend to implement our model on more datasets. We will also take a deeper look at the graph convolution and study whether our model can be applied to other research fields.

CONCLUSION
In this work, we propose a semi-supervised classification model based on CNN and GCN. In addition, we introduce attention mechanism into the GCN branch to reduce the information loss in the traditional graph convolution step. We compared GACNN with other network models, and the results show that the proposed method can significantly improve the classification performance of DR images. Finally, we conducted ablation experiments to verify the attention mechanism and the effectiveness of feature fusion between the two networks. The combination of the two networks and the attention mechanism can effectively enhance the ability of extracting the hierarchical information of image. Writing, review and editing. All authors contributed to revising the paper.

AC K N OW L E D G M E N T S
This work was funded by the National Natural Science Foundation of China (61971271), the Jinan City-School Integration Development Strategy Project (JNSX2021023), and Shandong Province Major Technological Innovation Project (2022CXGC010502).

C O N F L I C T O F I N T E R E S T
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.